Background & Context: Cars4U

Daniyal Nathani


Context:

There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholes in this market. In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.

As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.

Objective:

  1. Explore and visualize the dataset.
  2. Build a linear regression model to predict the prices of used cars.
  3. Generate a set of insights and recommendations that will help the business.

Data Dictionary:

  1. S.No. : Serial Number
  2. Name : Name of the car which includes Brand name and Model name
  3. Location : The location in which the car is being sold or is available for purchase Cities
  4. Year : Manufacturing year of the car
  5. Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.
  6. Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG)
  7. Transmission : The type of transmission used by the car. (Automatic / Manual)
  8. Owner : Type of ownership
  9. Mileage : The standard mileage offered by the car company in kmpl or km/kg
  10. Engine : The displacement volume of the engine in CC.
  11. Power : The maximum power of the engine in bhp.
  12. Seats : The number of seats in the car.
  13. New_Price : The price of a new car of the same model in INR Lakhs.(1 Lakh = 100, 000)
  14. Price : The price of the used car in INR Lakhs (1 Lakh = 100, 000)

Best Practices for Notebook :

Concepts to Cover:

  1. Overview of the data
  2. Data Visualization
  3. Data Preparation
  4. Choose Model, Train and Evaluate
  5. Conclusion
  6. Linear Regression Assumptions

Let's start coding!

Importing necessary libraries

Importing dataset

First look at data

From a first glance of the first 5 rows, I see there are already some null values as well as units in the Mileage, Engine, Power, and New_Price columns that are going to needed to be removed.

With this visualization, we can see that a large portion of the New_Price values are completely missing. The same rows have the Engine, Power, & Seats data missing.

From here, we can see that the datatypes don't all align properly as numbers - primarily because they have units attached. The most startling data from this visualization is the amount of missing data (86%!) in the New Price column. We have to keep this in mind moving forward on our analysis.. may not be a good idea to include in the overall model.

From here, we see a little overview of the numerical data. The serial number seems evenly distributed so it can be dropped. Kilometers driven has a ridiculously high max value much further away from the rest of the data. Mileage, Engine, & Power numerical analysis will have to be done after we strip them and apply the proper datatype format.

Data Preprocssing, Cleaning, & Features Engineering

  1. Removing Serial Number column. This column won't be helping with the predictions.
  2. Investigating ridiculously high amount of kilometers driven.
  3. Investigating zero values of seats
  4. Extracting Brand & Model Columns while dropping Name Column
  5. Cleaning Data and Converting datatypes
  6. Let's deal with our missing values and then move on to dealing with outliers and distribution skewness.

  7. Outlier & Distrbution analysis using Univariate analysis

  1. Removing Serial Number column. This column won't be helping with the predictions.
  1. Investigating ridiculously high amount of kilometers driven.
  1. Investigating zero values of seats
  1. Extracting Brand & Model Columns while dropping Name Column

A little bit of analysis on the new extracted feature. There are 33 unique brands. Let's check how many models there are per brand.

We can see that a lot of the brands have only 1 car in this dataset. Maruti brand cars are the most seen in the dataset.

  1. Cleaning Data and Converting datatypes

5.1 Removing non-numerical parts in Milage, Engine and Power columns

5.2 Chaning New_price data according to the units. Most are in lakhs, but few are in Cr. (100 lakhs in 1 Cr)

Confirming the datatypes have been converted

  1. Let's deal with our missing values and then move on to dealing with outliers and distribution skewness.

After imputing all the null values in this database, we can get a more complete look at the data to help with the linear regression model. We started with ~7,200 rows and are down to about 6000 rows now. This was done to help with the prediction accuracy!

  1. Outliers and Distribution Skewness

From these, we can see that the Price, Kilometers drive, and Power have the most outliers present. The Skewness of Price, New_Price, and Kilometers driven is going to be needing some transformation done to help with the distribution.

'New_Price', 'Kilometers_Driven', and 'Price' all seem to be very skewed. These 3 would be most benefited from doing a log transformation or something similar to that. Let's take a closer look.

*To fix the distribution issue we are seeing here, there are three different approaches I'm going to attempt. A log transformation, archsinh transformation, and a square root transformation. From there, we'll be able to better determine which of these transformations we want to keep.

The log transformation for each of these looks to be the best transformation to apply.

Data Analysis using Visulizations

From here, we can see that the brands of the cars are significantly skewed because of being associated with a lot of models & cars in this dataset! This may be a feature we need to take out.

Correlations with respect to Price

Using the categorical variables and these correlations to gain more insight

The price of cars based on the type of fuel being used shows a great amount of distribution and outliers within Petrol! The price is significantly affected here due the Petrol, with Diesel right behind.

As expected, there has been a steady increase in the price of the vehicle as the years have gone by up until 2010. Afterwards, we see a steeper increase in prices with many more outliers- showing the average price hasn't even got close to catching up with the max!

The sale of vehicles after 4 or more owners significantly decreases, with a very low value! You are most likely to sell your car as a First owner for the best price. There is one outlier as the third owner of a vehicle was able to sell his vehicle for a very high price!

As the power of the vehicle goes up, the price of the car also goes up. First owners of cars usully have better price value, but there is no proper distribution when grouped by on Owner_type

*Brand Maruti tends to have the weakest engines as well as the lowest price points of their vehicles! The more powerful, more expensive vehicles belong to Porsche, Lamborghini, and other sport brands!

Petrol vehicles have the strongest strong engines in this dataset, but most of the Petrol vehicles are on the lowest end of engine output. The Diesel fuel type has a lot of values in the 2000-3000 range for engine output

Automatic cars tend to more expensive, and only automatic cars have engine outputs of over 3500! Manual cars are on the lower end of the value scale

We have seen a lot of relationships on the Univariate and Bivariate graphs. It's almost time to start with the model!

Outliers Treatment!

*I chose to treat the outliers by flooring and capping the outter quartile ranges

Model Building

We are going to drop Model and New Price from the modeling set. The car model is addressed and taken into account with the Brand.

With New Price of the car, the data had 86% of missing values. Even with imputations, this results in much of data being fabricated based on only 14% of the total rows of data. This is widely inaccurate and would cause improper predictions.

Dummy Encoding

We need to take care of the categorical values so that they can contribute to the model efficiently. I am using the dummy encoding approach to do this

Splitting the data for training/testing at 70/30!

Training and Evaluating the Lienar Regression Model for Price

From the correlations, we can see that having a Bently will increase your price value the most! The coefficient seen as the Datsun brand is the lowest, meaning this will hurt your car value price the most

The mean absolute error (MAE) is the simplest regression error metric to understand. We'll calculate the residual for every data point, taking only the absolute value of each so that negative and positive residuals do not cancel out. We then take the average of all these residuals. Effectively, MAE describes the typical magnitude of the residuals.

-The RMSE is low on both sets, and shows very little change between the test/train scores. This drives show that this model is a good fit with comparable RMSE values.

-With a low MAE value, we can understand the typical magnitude of the residuals is very low, meaning the variability is low.

-With an R^2 showing about 95.5%, the model shows decent performance overall The train and test data is very close as well.

-We need to check the linear regression assumptions and confirm we followed the rules. The most important part!

Checking the Linear Regression Assumptions

  1. No Multicollinearity
  2. Mean of residuals should be 0
  3. No Heteroscedacity
  4. Linearity of variables
  5. Normality of error terms

1 - No multicollinearity - Checking the VIF scores of each before feature selection. Must be below a threshold of 5.0!

With many scores above the common threshold of 5, we need to dive deeper into getting this fixed.

Almost all of the Brands have an extremely high amount of colinearity. Removing the brand with the highest colinearity from the dataset would be the best approach in this case. Brand_Maruti will be removed VIF scores reevaluated.

*Dropping the Brand Maruti column helps tremendously. Now we need to focus on Fuel_Type_Diesel, as well as Engine & Power.

Dropping the Fuel_Type_Petrol would be the best option here as it affects the rmse the least.

Engine and power are the only ones that need to be treated

Dropping the Engine would be the best approach as it affects the predictability in the best way.

The condition is now satisfied! No VIF scores are above 5, showing no multicollinearity!

2. Mean of residuals should be 0

This value is very close to zero and thus, satisfies the condition of residual mean value being basically zero.

3. TEST FOR LINEARITY

Why the test?

How to check linearity?

4. TEST FOR HOMOSCEDASTICITY

5. TEST FOR NORMALITY using the QQ plot and distribution.

According to the QQ Plot and distributions, we can determine that the test for Normality has been satisfied.

This means all 5 of the linear regression model conditions have been satisfied and we can move forward.

Updating the test and train data with the new columns after going through the assumptions.

Performance Check on the Final Model

These values are are a small decrease in comparison to the original training performance, but all the assumptions are satisfied! The linear regression has been fit well for this dataset and can predict with a high accuracy

The amount of seats in a car seems to be a big factor in how much the car is worth! The more seats you have, the cheaper your car gets.

The overall performance of this model was pretty strong. The New_Price column had a large amount of missing values so it was not used in the model as it would have been misleading to the model due to correlation.

If we want to be specific and see which columns and how many yield the best results, we can go forth with feature selection!

Feature Selection for Further Imrpovement